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Abstract 

Cross-species research in drug development is novel and challenging. A bivariate mixture model utilizing information 
across two species was proposed to solve the fundamental problem of identifying differentially expressed genes in 
microarray experiments in order to potentially improve the understanding of translation between preclinical and 
clinical studies for drug development. The proposed approach models the joint distribution of treatment effects 
estimated from independent linear models. The mixture model posits up to nine components, four of which include 
groups in which genes are differentially expressed in both species. A comprehensive simulation to evaluate the model 
performance and one application on a real world data set, a mouse and human type II diabetes experiment, suggest 
that the proposed model, though highly structured, can handle various configurations of differential gene expression 
and is practically useful on identifying differentially expressed genes, especially when the magnitude of differential 
expression due to different treatment intervention is weak. In the mouse and human application, the proposed 
mixture model was able to eliminate unimportant genes and identify a list of genes that were differentially expressed 
in both species and could be potential gene targets for drug development. 
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Introduction 

Background 

Pharmaceutical medicine is an industry with huge up- 
front investment for rewards that may or may not come 
years later A complete drug development process, includ- 
ing drug discovery, preclinical research (on animals) and 
clinical trials (on humans), is lengthy, expensive, and risky. 
Determined by the US Food and Drug Administration 
(FDA) [1], the average total cost per drug development 
is about $1.9 billion. The typical development time is 10 
to 15 years. The overall attrition rate of a drug com- 
pound from first-in-man to registration is approximately 
80%-90% [2,3]. 

FDA [1] calls the preclinical and clinical research 
together as the 'critical path' development phase, where 
most investment required for a successful drug launch 
occurs. Currently, this development phase is inherently 
inefficient. The goal of preclinical research is to assess 
how a drug is absorbed, distributed, metabolized, and 
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excreted in animals, and to use the findings to deter- 
mine potential human outcomes before starting clinical 
trials. Yet the rate of success after a drug candidate 
enters Phase I is undesirably low. As mentioned in FDA 
[1] and Kola and Landis [3], animal models with poor 
clinical relevance may be accountable for this perplex- 
ity. Hence, improving translation between two species to 
increase the predictive power of animal models to human 
studies is of tremendous value to drug discovery and 
development. 

Homology and multiple species gene expression analysis 
In drug development 

Microarrays are tools for gene expression analysis and 
can be potentially useful for investigating the mechanism 
of drug activities that translates across species. The util- 
ity of microarray information in the drug development 
process is reviewed by Braxton and Bedilion [4] who 
embraced the idea that gene expression analysis can be a 
surrogate marker for the interaction between compounds 
and cells and should yield information about efficacy. 
Debouck and Goodfellow [5] believed that microar- 
rays can be used to generate clues to patterns of gene 
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function that can help improve the efficiency of drug 
development. 

As stated before, one key challenge of drug develop- 
ment is to successfully translate the results of preclinical 
findings in animal models to human beings in the clinic. 
Pre-clinical experiments assume that the effect of the drug 
tested on animals is comparable to that on humans, which 
can only be true if a functional equivalent of the human 
drug target exists in the experimental species. Orthol- 
ogy [6-8] is a strong indication of functional conservation 
and therefore provides the best functional annotation 
of experimentally undetermined genes across species. 
Holbrook and Sanseau [9] remarked that the use of 
orthologs has the potential to improve the understand- 
ing of biological differences between species (animals and 
humans). 

Many of the successful applications of cross-species 
microarray gene expression analysis involve orthology 
[10-13]. Additionally, over the past decade, researchers 
have tried to use orthology and gene expression data to 
do cross-species comparison in order to understand how 
genes interact to perform particular biological processes 
[14,15]. These studies support the idea that orthologs 
could be a useful tool for researchers to link experiments 
between species in drug development. Note that orthol- 
ogous relationships can be one-to-one, one-to-many, or 
many-to-many [8]. 

The rest of the paper is structured as follows: 
Section 'Joint modeling across species' describes the 
proposed bivariate mixture model across species. 
Section 'Simulation' describes a simulation study under- 
taken to investigate the effects of different experimental 
designs on the power to detect important genes and on 
misclassification rates. Section 'Application: the mouse 
and human type II diabetes experiment' illustrates the 
methodology using an application to data collected in a 
mouse/human experiment. Section 'Concluding remarks' 
concludes. 

Joint modeling across species 

Let Xaij and X/j.^ denote gene expression measurements 
from the ith orthologous gene pair for the jth animal and 
the Ith human. The following independent linear mod- 
els describe the association between gene expression and 
treatment: 

Xaij = Poat + PUi Taj + (1) 

^ha = fiohi + PihiThi + e/!,,. (2) 

where Ta^ and T/,^ are {0, 1} treatment indicators, and 
Caij and e;,.; are independent N{0, a^) and NiO, a^) ran- 
dom variables, cr^ and are variances for 6^,^ and e;,.^, 



respectively. In drug development, the animal research 
and human experiments are conducted independently - 
one's results do not affect the other's. However, the 
treatment effects are expected to have some kind of 
association between the two species. This results in 
our choice of using two independent models for the 
two species to capture the effects of treatment on gene 
expression. 

A nine-component bivariate mixture model for two species 
experiments 

Plat and fiihi quantify the differential expression of the 

ith orthologous animal and human genes due to a treat- 
ment intervention. A given gene can be classified as 
non-differentially expressed (NDE) - showing no signs 
of treatment effects, positively differentially expressed 
(pDE) - showing positive treatment effects, or negatively 
differentially expressed (nDE) - showing negative treat- 
ment effects. Therefore, for a human and animal gene 
pair, there are nine possibilities for categorizing this 
pair of genes. Further, dependency is assumed between 
differentially expressed orthologs, i.e., existence of asso- 
ciation posited only for {Piai> Pihi)^ in categories (1, 
2, 3, 4) and zero correlation presumed for (fiui, Pihi)'^ 
in categories (0, 5, 6, 7, 8). Table 1 illustrates the nine 
possible categories of (/Sia,, /Si/,,)^. (m Aa,- . M A^,. ) ^ is the 
vector of population means of iPui>Pihi)^ under each 
category. 

In consequence of these possible patterns of 
WiapPihi)^' mixture models [16,17] are adopted to deal 
with the correlation and distribution of each subgroup of 
genes across species. An additional advantage of mixture 
models is that, after prior weights for the components 
are specified, estimates of the posterior probabilities of 
population membership can be formed for each obser- 
vation to give a probabilistic clustering. As a result, 
the pooling of information for genes across species can 
be exploited to better understand the underlying rela- 



Table 1 Possible categories of (Pioi, fiihi)^ 


Category 






Corr(/Ji«,,/3iA,) 


0 


(NDE,NDE) 


(0,0) 


0 


1 


(pDE,pDE) 


(+,+) 


Pi 


2 


(nDE,nDE) 


(-, -) 


P2 


3 


(pDE.nDE) 


(+.-) 


P3 


4 


(nDE.pDE) 


(-.+) 


P4 


5 


(NDE.pDE) 


(.0.+) 


0 


6 


(NDE,nDE) 


(0,-) 


0 


7 


(pDE,NDE) 


(+.0) 


0 


8 


(nDE,NDE) 


(-0) 


0 
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tionship between the treatment intervention for both 
species. 

Tailoring the mixture model to two-species experiments 

with restrictions on the parameters made according to 
Table 1 and assuming that the treatment effects for non- 
differentially expressed genes are deterministically zero, 
i.e., (Plait Pihi)^ = (0' 0)"^' the following bivariate normal 
mixture model is adopted as the prior distribution of the 
vector (fiui.fiihi)^- 



Plhi 



' TtoN 

+ 7riN 

+ 7T2N 

+ 7l-iN 
+ 7tiN 

+ JtsN 
+ 7teN 
+ jtjN 
+ 7rsN 



MAO 



fJ-ai 



\Pir]aimi nil 



{ '?a4 Piriairjhi 
\ PirjaiVhi 

( '?a5 PSVaSmS 
\P5ria5Vh5 vis 



lie 



^a8 P&^aSnm 
PSiaSmS ^Is 



(3) 



where Jtj^ is the probability that an observation belongs to 

the ^h component, with Ylk=o — ^ and jt/^ > 0. The 
following restriction of the parameter space is imposed: 

f^aO = 0, jXhO = 0, flal > 0, flhl > 0, IJ.a2 < 0, 

I^h2 < 0. M«3 > 0, ^lh3 < 0, flai < 0, Mft4 > 0, flaS = 0, 

MAS > 0, /Xfl6 = 0, fli,6 < 0, fla7 > 0, flh7 = 0, 

Mfl8 < 0, Hhs = 0, tJaO = 0, rjho = 0, r]a5 = 0, tja6 = 0, 

f]h7 = 0, r]h8 = 0, po = 0, P5 = 0, P6 = 0, P7 = 0, and 

P8 = 0. 

According to the theory of least squares, the marginal 
distribution of (Piai> fiihi)^> the parameter estimates, has 
means equal to the prior means of (fiui, Pihj)^ and vari- 
ances involving contributions from the prior distribu- 
tion of (Plait Pihi)^ and the conditional distribution of 



(Piai, Pihi)'^ given (Piap Pihi)'^ ■ The marginal distribution 

of (Pui, Pihi V is as follows: 



Plai 
Plhi 



TTqN 



+ niN 



+ 7t2N 



+ 7t3N 



+ 7t4N 



0 a, 



ho 



fJ-al 
fJ-hl 

Ma2 
fJ-h2 

P^a-i 
Ma4 

0 

MAS 

0 

MA6 
0 

Ma8 

0 



a 



al 

Pl^alOhl 



a 



a2 



PlOalOhl 



P2<^a2(^h2 
2 



P2<ya2(^h2 f, 



A2 



cr 



a3 P3CT«3CTA3 
2 



P30'«30'A3 f, 



A3 



cr 



«4 



P4(Ta40^A4 
2 



0 < 



0 < 



0 



0 CT 



(4) 



An EM algorithm is developed to accomplish the nontriv- 
ial likelihood maximization, along with methodology for 
handling singular covariance matrices that arise during 
the implementation of the algorithm. (See the Appendix 
for details). Gene membership is determined according 
to the maximum posterior probability that an observa- 
tion (PuiiPihi)^ comes from the Arth component of the 
mixture. 

Simulation 

The following Monte Carlo simulation studies investi- 
gated the performance of the proposed mixture model 
using information across two species in comparison to 
the traditional microarray method using just one-species 
information when identifying genes associated with treat- 
ment stimulus under several different scenarios. 

Several factors influence the sampling properties of the 
estimated treatment effects (Pui, Pihi)^ using the mixture 
model were considered in the simulation: 

• Replicates (number of arrays) per treatment for each 
species: ria and for animals and humans, 
respectively. 

• Number of orthologous genes in each category: n^, 
/: = 0, . . . , 8, the /rth category. 
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• Array noise: e^^y and e/,.. in (1) and (2). Also recall 
that, by assumption, Caij and e/,.j are independent 
N(0, (T^) and N(0, a^) random variables. 

• Parameters in (3) by which the sampling distribution 
of (/Siaj, fiihi)^ is determined. 

With so many variables, it is impractical to study the 
sampling properties of the fitted model without fixing 
some variables. That is, an experimental design for the 
simulation in which these factors are completely consid- 
ered is not feasible. The simulation study instead focused 
on three aspects. First, although high-density microarrays 
provide useful genome-wide data, they are often associ- 
ated with a substantial amount of experimental noise that 
could affect the performance of the analysis. Hence, it is 
of interest to investigate how the array noise would affect 
the model efficiency on gene identification across species. 

Second, the sample size of cross-species experiment is 
likely to be different, and may be one of the deciding fac- 
tors of the power associated with the modeling approach. 
In particular, the efficiency of gene identification, whether 
the proposed model gains power over one-species experi- 
ments through pooling information across species, should 
be examined carefully, especially when the sample size of 
the experiments is small. 

Third, over-fitting may be of concern. The proposed 
mixture model is, by its nature, highly structured and data 
driven. If the data are not driven by all nine categories as 
the model suggests, will the mixture model fail? Is the pro- 
posed model flexible enough to handle different types of 
data structure? To examine the model performance sys- 
tematically and to test if the proposed model will fail when 
too many components are used to fit the data where there 
are actually fewer clusters, two types of data were gener- 
ated: all nine categories non-empty (case I) and some of 
the nine components empty (case II). 

In simulation studies case I and case II, two methods 
of gene identification were implemented: the proposed 
mixture model, utilizing information across two different 
species, and the traditional t statistics adjusting for mul- 
tiple comparisons based on single-species data. Five hun- 
dred data sets were generated for each different scenario 
under each simulation study. 

Parameter determination and data generation 

Theoretically, the number of genes in category 0 (non- 
differentially expressed in both species) should dominate 
others, and every other category may comprise some 
genes. Orthology information from HomoloGene of the 
National Center for Biotechnology Information (NCBI) 
(http://www.ncbi.nlm.nih.gov/homologene) and Mouse 
Genome Informatics (MGI) of Jackson Laboratory [18] 
and the practical experience gained from analysis of two- 
species gene expression experiments at GlasoSmithKline 



(GSK) was used as reference to determine a reasonable 
number of genes in each category. 

The vectors of the number of genes in each cate- 
gory (mq, «i, . . . , Mg)^ determined for simulation stu- 
dies case I and case II are categories (6, 000, 30, 
30, 30, 30, 100, 100, 100, 100)^ and categories (6, 000, 30, 
30,0,0,100,0,100,0)^, respectively. For experiments 
across species, sample sizes may differ. Considering 
that this proposed bivariate method could benefit from 
pooling information across species, especially when the 
sample size is small, and the practical situation, two sce- 
narios were implemented: the number of replicates per 
treatment for each species is equal and small, and the 
number of replicates per treatment for animals is greater 
than humans. In addition, to evaluate how robust the 
proposed method is against array noise, two situations 
were considered: the two experiments are equally noisy 
and the human data are noisier than the animal's. Fur- 
thermore, values of parameters in (3) for the sampling 
distribution of (fiiai> Pihi)^ were predetermined. Vari- 
ances of (Plait Pihi)^ in each component were assumed 
to be the same: iij^ = r]l^ = r^^ = r]l^ = iij-j = r]l^ = 
'?W = ^2 = lis = "Im = ^5 = ^6 = The corre- 
lation between iPui>Pihi)^ i^i categories (1,2,3,4) was 
assumed to be 0.9, i.e., pi = P2 = Ps = Pi = 0.9. Nonzero 
component means (fiah H-hk) ^> ^ = 0, . . . , 8, were deter- 
mined so that \iJ./r]\ = 0.5 or 1.5. The combination of 
these parameters resulted in eight different scenarios for 
each case as presented in Table 2. Note that IfJ^^i^.l and 
represent the absolute value of the mean vector of 

WlarPlhif- 

After generating (fiiapPihi)^ accordingly, the next step 
is to simulate the two species gene expression data Xa 
and based on linear models (1) and (2). Note that /Joa; 
and Pohi are independent N(8, 1) random variables for 
differentially expressed genes and deterministically 0 for 
non-differentially expressed genes. 

Simulation results 

It is of interest to compare how effective the mixture 
model is on gene identification using information across 
two species with the conventional one-species approach. 
The conventional two-sample t test for gene selection 
was performed using just single species data (animals 
or humans) and a multiplicity adjustment was made 
according to the procedure proposed by Benjamini and 
Hochberg [19]. 

The results are presented in Table 3. The first section 
of Table 3, categories (1, 2, 3, 4, 5, 6), manifests the num- 
ber of genes classified into categories (1, 2, 3, 4, 5, 6) 
using the the mixture model (Mixture) and the number of 
genes selected using human data only (Human only), with 
the corresponding nominal FDR controlled at FDR/. For 
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Table 2 Combination of parameters for simulation studies case I and case II 



Casel 


sim1 




sim2 sim3 


sim4 


simS 


sim6 




sim7 


sim8 


"o 


10 




10 10 


10 


100 


100 




100 


100 


nh 


10 




10 10 


10 


10 


10 




10 


10 




0.25 




0.75 0.25 


0.75 


0.25 


0.75 




0.25 


0.75 




0.25 




0.75 0.25 


0.75 


0.25 


0.75 




0.25 


0.75 




0.1 




0.1 0.1 


0.1 


0.1 


0.1 




0.1 


0.1 




0.1 




0.1 0.3 


0.3 


0.1 


0.1 




0.3 


0.3 


Case II 


sim9 




simlO simll 


siml2 


sim13 


Siml4 




sim15 


sim16 


each simulated data set, FDR/ was calculated as (number alone (Animal only). Beneath each set of eight simulation 
of genes that are erroneously classified into categories cases is Tukey's Honestly Significant Difference (HSD) for 
(1, 2, 3, 4, 5, 6))/(total number of genes classified into a familywise error rate of 0.05 for the results obtained 
categories (1, 2, 3, 4, 5, 6)). This was to ensure a fair com- using the proposed mixture model. Tukey's HSD was 
parison between the mixture model and the conventional calculated as qo.osi^, 3, 992) x (MS(Error)/500)^''^, since 
one-species method. FDR// was calculated in the same there were eight simulation cases (500 simulated data 
fashion. Avg FDR/ and Avg FDR// are simply the aver- sets in each situation) in each simulation study (case 
aged values of FDR/ and FDR// across the 500 simulated I and case II) and the error degree of freedom was 
data sets. When the estimated nominal FDR = 0, i.e., 3,992. MS(Error) denotes the error mean square (= Error 
the mixture model did not falsely categorize any genes, sum of squares/Error degree of freedom, see Table 4). 
the nominal FDR for the single-species method was con- g'o.osCS, 3, 992) = 4.29. 

trolled at 0.0001. The second section of Table 3, categories The effect of array noise on the mixture model can 
(1, 2, 3, 4, 7, 8), represents the number of genes classified be easily seen from the column of Avg FDR/ and by 
into categories (1,2,3,4,7,8) using the mixture model comparing the results of siml vs. sini3, sim5 vs. sim7, 
(Mixture) and number of genes selected using animal data sim9 vs. simll, and siml3 vs. siml5. These are the cases 


Table 3 The number of genes selected based on (a) bivariate mixture model, (b) conventional one-species approach 








Categories (1,2,3,4,5,6) 






Categories (1,2,3,4,7,8) 






Mixture 


Avg FDR/ 


IHuman only 


lUixture 


Avg FDR// 


Animal only 


siml 




132 


0.034 


1 00(0.070) 


129 




0.029 




96(0.063) 


sim2 




224 


0.003 


145(0.021) 


223 




0.012 




188(0.016) 


sim3 




113 


0.246 


85(0.318) 


135 




0.048 




109(0.073) 


sim4 




166 


0.050 


115(0.043) 


222 




0.012 




188(0.016) 


simS 




132 


0.028 


98(0.041) 


234 




0.004 




238(0.012) 


sim6 




227 


0.011 


194(0.021) 


289 




0.003 




289(0.007) 


sim7 




112 


0.235 


78(0.282) 


241 




0.011 




246(0.020) 


sim8 




167 


0.048 


124(0.065) 


288 




0.002 




288(0.007) 


Tukey's HSD 




30.908 


0.021 




52.958 




0.016 






sim9 




92 


0.126 


81(0.296) 


88 




0.129 




75(0.307) 


simlO 




118 


0.033 


104(0.048) 


118 




0.036 




99(0.061) 


siml 1 




141 


0470 


109(0.670) 


80 




0.128 




70(0.214) 


siml2 




103 


0.116 


68(0.118) 


118 




0.049 




102(0.088) 


simlB 




84 


0.140 


67(0.169) 


180 




0.349 




1 67(0.407) 


siml4 




120 


0.023 


88(0.023) 


152 




0.022 




151(0.101) 


siml5 




119 


0480 


99(0.636) 


168 




0.358 




157(0.369) 


siml6 




96 


0.093 


57(0.105) 


147 




0.020 




148(0.061) 


Tukey's HSD 




3.947 


0.011 




2.337 




0.004 







Under simulation studies case I and case II. Numbers in parentheses are the observed FDRs. Averaged over the 500 simulated datasets. Tukey's HSD for an a level of 
0.05 is Included beneath each set of eight simulation cases. 



Su etal. Human Genomics 2014, 8:12 
http://www.humgenomics.eom/content/8/1 /1 2 



Page 6 of 12 



Table 4 ANOVA table to quantify variability 









Categories (1,2, 3,4,5, 6) 


Categories (1, 2, 3, 4, 7, 8) 




Source of variation 


df 


Sum of squares 


Sum of squares 


Casel 


Replicates 


1 


1,141(0.006) 


7,319,401(0.408)* 




Mean magnitude 


1 


5,415,341(11.560)* 


4,982,736(0.252)* 




Array noise 


1 


1,561,198(15.950)* 


6,353(0.042) 




Replicates x Mean magnitude 


1 


1,197(0.032) 


392,099(0.122)* 




Replicates x Array noise 


1 


697(0.012) 


351(0.011) 




Mean magnitude x Array noise 


1 


400,720(7.021)* 


16,601(0.043)* 




Replicates x Mean magnitude x Array noise 


1 


145(0.002) 


214(0.010) 




Error 


3,992 


1,689,667(12.887) 


592,184(1.817) 




Total 


3,999 


9,070,107(47.471) 


13,309,941(2.706) 


Case II 


Replicates 


1 


73,917(0.006) 


3,676,664(2.888) 




Mean magnitude 


1 


6(56.346) 


22,274(2.830) 




Array noise 




1 30,794(43.770)* 


865,448(0.001)* 




Replicates x Mean magnitude 




37,277(0.190) 


36,778(1.038) 




Replicates x Array noise 




34,404(0.015) 


5,065(0.049) 




Mean magnitude x Array noise 




929,915(17.537) 


19,128(0.043) 




Replicates x Mean magnitude x Array noise 




1 ,467(0.004) 


58(0.000) 




Error 


3,992 


103,604,971(47.182) 


304,161,196(27.427) 




Total 


3,999 


104,812,751(165.052) 


308,786,610(34.276) 



ANOVA was performed independently for simulation studies case I and case II to quantify the variability among the results (gene counts and observed FDRs). Numbers 
in parentheses are the sum of squares for the observed FDRs. For gene counts, "indicates which sources of variability could be declared significant at level a = 0.05. 



with smaller means and variances for humans which 

change from 0.1 to 0.3 for each pair of comparisons. 
The differences between the observed FDRs for these 
four groups were at least 0.2 and 0.3 for case I and 
case II, respectively. In contrast, when means were larger, 
changes of variances did not seem to affect the results 
in the sense that the corresponding observed FDRs had 
barely changed while the variances of human increased. 
The observed FDRs for animals were not as sensitive to 
the array noise on humans as the observed FDRs for 
humans. 

Under both simulation studies case I and case II, 
increasing the number of replicates in the animal exper- 
iment helped the gene identification for animals: more 
animal genes were identified for sim5 to sim8 than for 
siml to sim4, and for siml3 to siml6 than for sim9 to 
siml2. The corresponding FDRs were also lower for sim5 
to sim8 and siml3 to siml6. In contrast, increasing the 
number of replicates in the animal experiment did not 
significantly improve the results of gene identification for 
humans. 

Table 4 is the summary of a three-way analy- 
sis of variance (ANOVA). Three factors, each with 
two levels, were used in the analysis: replicates 
= (10,10)^ or (100,10)^), mean magni- 
tude = (0.25,0.25)^ or (0.75,0.75)^), 



and array noise {{(rj.al)'^ = (0.1,0.1)^ or (0.1,0.3)^). 
This analysis, performed independently, quanti- 
fies the variability among the results (gene counts 
and observed FDRs) obtained using the proposed 
mixture model in Table 3 for the 16 different simu- 
lated situations under simulation studies case I and 
case II. 

Throughout the 16 simulation cases, with nominal FDR 
controlled at FDR/, the bivariate mixture model outper- 
formed the single-species method for human gene iden- 
tification by always recognizing more genes with lower 
observed FDRs. For the animal part, the mixture model 
performed at least as well as the single-species method 
on gene selection by identifying at least as many genes. 
Notice that selecting genes related to humans (categories 
(1, 2, 3, 4, 5, 6)) seemed to be associated with higher false 
discovery rate than selecting genes related to animals (cat- 
egories (1, 2, 3, 4, 7, 8)). Furthermore, the observed FDRs 
were lower for case I than for case II, regardless of the type 
of genes interested (differentially expressed for humans or 
animals). 

The comprehensive simulation study suggested that the 

proposed model, though highly structured, offered advan- 
tages over single-species analyses, especially when the 
magnitude of differential expression due to different treat- 
ment intervention was weak. 
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Application: the mouse and human type II diabetes 

experiment 

Background introduction 

A systems biology study was completed by GlaxoSmithWine 
(GSK) to study the efficacy of type II diabetes drugs 
in both preclinical (mice) and clinical (humans) experi- 
ments. The mouse and human data were collected and 
preprocessed using Affymetrix MAS 5.0 at the probe set 
level on Affymetrix MOE430A array and Human Genome 
U133 Plus 2.0 array, respectively. For mice, the total num- 
ber of probe sets was 22,690. Mice were fed a diet enriched 
in fat (58% kcal from fat) for 8 weeks prior to treatment. 
Most of the mice of this susceptible strain developed 
obesity and mild hyperglycemia and hyperinsulinemia. 
Control mice on an 11% low fat diet remained normal. 
The mouse treatment arm consisted of a diabetes drug at 
multiple dose levels with vehicle controls over a 2-week 
period. The study was a full factorial design, where 40 ani- 
mals in high fat diet and 40 animals in low fat diet were 
randomized to receive either placebo, or different dosages 
of the type II diabetes drug (low, medium or high). Note 
that the results for mice were measured at one time point, 
the end of the study. For simplicity, only integration results 
on animals treated with placebo (10 mice) or high dose of 
the drug (9 mice) were demonstrated. 

On the other hand, there were 54,676 probe sets in the 
human data set and all 59 subjects were type II diabetes 
patients. The gene expression measurements for human 
subjects were collected twice during the experiment, one 
at baseline before the treatment started (week 0) and the 
other one at the completion of the study (week 8). Among 
the 59 subjects, 14 only had data for only one time point 
and hence were not included in further data analysis. Sub- 
jects in the clinical trial were treated by either placebo, or 
three other type II diabetes drugs, including the one given 
to the mice. Titrated dosing was implemented to ensure 
that each person received his/her dose based on his/her 
body profile. Human subjects treated with placebo (num- 
ber of subjects = 11) and the same type II diabetes drug 
given to mice (number of subjects = 13) were used in this 
analysis. Besides being measured at two time points, other 
information contained in the human data and used in the 
data analysis included: prior therapy (four categorical lev- 
els) and concomitant medication status (two categorical 
levels). 

Data for both species were logarithmic transformed 
(base equal to 2). Methods for combining information 
from multiple probe sets that were not identical have been 
discussed in many publications [20,21]. A gene-level tran- 
script value in order to pair the mouse and human genes 
through orthology was obtained by averaging probe sets 
across a gene. This resulted in 13,483 and 20,252 genes 
for mouse and human, respectively. Missing values were 
replaced with array means in both data sets. 



The mouse and human orthology information, MGI 
release 4.32, was used to map the mouse and human 
orthologous genes. Human and mouse orthologs (17,834) 
were included in release 4.32. Combining the information 
among MGI release 4.32 and mouse and human genes 
from the GSK data gave a total of 11,922 orthologs. 

In addition to the measurements of gene expression for 
both mice and humans, different efficacy endpoints, such 
as blood glucose, insulin, hemoglobin Ale (HbAlc), and 
others were also measured during the preclinical and clin- 
ical experiments by GSK in order to evaluate the effect of 
treatment intervention on both species. 

The purpose of this data analysis was to evaluate the 
capability of the proposed mixture model, an approach 
utilizing the information across two species, to identify 
genes that may help scientists reveal the biological simi- 
larity between two species (ex: mouse and human), essen- 
tially genes in categories (1, 2, 3, 4), and so to improve the 
efficiency of drug development by decreasing the com- 
pound attrition rate from preclinical trials to clinical trials. 

Data analysis 

The estimated treatment effects for mice and humans 
are obtained by the following independent simple linear 
models: 

^aij = Poui + Plai Taj + efl,y. (5) 

XI = Poh, +^ft7*" + + PriorTherapy 

-I- /63fe;ConMed + eh^, (6) 

where / = 1, . . . , 11, 922 (number of orthologs); / = 
1, ...,19 (number of mice); / = 1, ...,24 (number of 
humans). X„,., and X^^^"''^ are gene expressions 

from the ith orthologous gene for the y'th mouse, gene 
expression from the ith orthologous gene for the /th 
human before treatment intervention, and gene expres- 
sion from the ith orthologous gene for the Ith human at 
the completion of the clinical experiment, respectively. 
Ta^ and T/^^ are {0, 1} treatment indicators, and 6^,^ and 
are independent N(0, a^) and N(0, a^) random vari- 
ables. Additionally, there are two more covariates in the 
human model: PriorTherapy (a four-level categorical vari- 
able indicating patients' therapy prior to the clinical trial) 
and ConMed (a two-level categorical variable for con- 
comitant medication status). 

Equation (4) was used to model the distribution of the 
estimated treatment effects. Genes in categories (1, 2, 3, 
4) are believed to be potential biomarkers that can greatly 
improve the design of the process of drug development as 
these orthologous genes interact with drugs in a way that 
shows some relationship between two different species (in 
this case, human and mouse) and so studying the behav- 
ior of these genes in preclinical trials might help scientists 
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better understand the mechanism of drug activities in 
clinical trials. 

Results 

Parameter estimation 

The maximum likelihood estimates of the parameters in 

the bivariate normal mixture model using the EM algo- 
rithm are given in Table 5. The estimated mixture weight 
for category 0 was tcq = 0.889 indicating approximately 
10,599 (11,922 X 0.889) pairs of uninteresting mouse 
and human orthologs. (i^a, denotes the mean vec- 
tor of each mixture component. The estimated treatment 
effect means of mice were in general larger than that of 
humans, indicating that the magnitude of the difference 
of expression in genes due to treatment intervention in 
mice tended to be larger than in genes where differential 
expression is exhibited in humans. The estimated variance 
for mice CT^^ tended to be larger than that for humans ct^^, 
suggesting that overall, the variability in the mouse data 
set was larger than the human data set. The estimated 
correlation coefficients between the treatment effects for 
both species were (—0.531, 0.380, —1, 0.102) for categories 
(1, 2, 3, 4), respectively. The estimated correlation for cat- 
egory 3 was based on only two observations. Standard 
errors of the parameter estimates from the two-species 
experiment were obtained by bootstrapping with 1,000 
bootstrap replicates. Based on the bootstrap standard 
errors, there was little evidence of bias for the parameter 
estimates. 

Gene identification 

The vector of the number of genes identified for each 

category was {no, hi, . . . , = (10,814, 41, 12, 2, 20, 
168, 578, 12, 275)^. Figure 1 displays the scatter plots of 
(PiapPihi)^ before and after gene membership identifi- 
cation, including the scatter plot of (Piaii Pih,)^ for 
orthologs, orthologs after eliminating the uninteresting 



ones (orthologs not in category 0), and orthologs reacting 
to the treatment stimulus for both species (orthologs in 
categories (1, 2, 3, 4)). 

Based on the bivariate mixture model, among the 11,922 
pairs of orthologs, 10,814 pairs did not react to the drug 
treatment for either species, 75 pairs of orthologs (sum 
of the gene counts in categories 1 through 4) showed 
evidence of differentiation between treatments for both 
species. Genes in categories (5,6,7,8) are also poten- 
tial candidates for further investigation to improve the 
process of drug development since studying these genes 
might uncover the myth of the overall high attrition rate of 
a drug compound from preclinical trials to clinical trials. 

In comparison, an attempt was made to identify differ- 
entially expressed human genes using solely the human 
data, i.e., traditional t statistics were used to test whether 
or not Pih- = 0 and the approach of Benjamini and 
Hochberg [19] was used to adjust for multiple compar- 
isons. With the nominal FDR controlled at 0.01, 0.1, 0.5, 
and 0.9, this single-species method failed to identify any 
differentially expressed human genes at any levels of FDR. 
With the p values histograms in Figure 2 showing an 
obvious difference between the observed significance of 
differential expression for mice and humans, specifically, 
the p values were nearly uniformly distributed for humans, 
the result was not surprising. 

Concluding remarks 

This research was motivated by a fundamental yet still 
not well-understood problem in the drug development 
process. The results obtained in preclinical animal trials 
do not seem to translate well enough to make inferences 
for human clinical trials, resulting in an undesirably high 
attrition rate in human experiments. A bivariate mixture 
model which utilizes information across two species was 
proposed to identify genes that exhibit similar patterns 
of expression across species, with the hope that studying 



Table 5 Parameter estimates of the bivariate mixture model 



Parameter estimates 



Category 














0 


0.889(0.013) 


NE 


NE 


0.117(0.002) 


NE 


0.052(0.001) 


1 


0.002(0.001) 


1.440(0.190) 


0.139(0.203) 


0.112(0.064) 


-0.070(0.040) 


0.154(0.061) 


2 


0.001(0.001) 


-1.174(0.228) 


-0.853(0.128) 


0.156(0.132) 


0.033(0.053) 


0.048(0.039) 


3 


0.000(0.001) 


2.104(0.319) 


-1.231(0.361) 


0.006(0.124) 


-0.005(0.078) 


0.003(0.111) 


4 


0.002(0.001) 


-1.047(0.081) 


0.660(0.128) 


0.027(0.018) 


-0.005(0.023) 


0.087(0.047) 


5 


0.021(0.001) 


NE 


0.669(0.003) 


0.144(0.000) 


NE 


0.050(0.006) 


6 


0.049(0.002) 


NE 


-0.787(0.001) 


0.149(0.000) 


NE 


0.055(0.004) 


7 


0.003(0.008) 


1.528(0.310) 


NE 


0.543(0.201) 


NE 


0.084(0.022) 


8 


0.034(0.009) 


-1.038(0.146) 


NE 


0.225(0.051) 


NE 


0.055(0.006) 



Bootstrap (B = 1 ,000) standard errors in parentheses. NE, not estimated. 
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Figure 1 Scatter plots of the estimated treatment effects {^imi^ihi)^ before and after gene membership identification. From left to right: 
all orthologs, orthologs differentially expressed in either species (categories (1, 2, 3,4, 5,6, 7,8)), and orthologs differentially expressed in both 
species (categories (1,2,3,4)). 



genes could help understand biological differences across 
species at the molecular level and ultimately help reduce 
attrition in drug development. It is also of great interest 
to identify genes active in animal, but not in human since 
studying this group of genes might lead to answers that 
explain why some drug trials fail in translation to humans. 
The comprehensive simulation study suggested that the 
proposed model, though highly structured, can accommo- 
date various configurations of differential gene expression, 
especially when the magnitude of differential expression 
due to different treatment intervention was weak. 

In the application of the bivariate mixture model on 
the GSK type II diabetes experiment, the mixture model 
was able to separate differentially expressed genes from 
non-differentially expressed genes. A potential multi-gene 
predictor may be developed according to the genes iden- 
tified by the bivariate mixture model to benefit patients in 
therapeutic decision making. 

The mixture model is highly structured, with strong 
but somewhat simplifying assumptions. The grouping of 
all genes for which the expression difference is positive 
in both species into a single category parameterized by 



a single bivariate mean may be somewhat of a simplistic 
approach. In practice, the data may not be normally dis- 
tributed or be comprised by exactly nine groups and lead 
to bias and inefficiency. Forcing the normality assump- 
tion and the grouping may be inefficient. However, by 
modeling the least squares estimates of the expression 
differences as bivariate realizations from a distribution 
with a single mean vector, some flexibility in the model is 
retained. It is perhaps less important to precisely quantify 
the magnitude of the expression difference than to deter- 
mine whether it is positive or negative and whether or not 
the direction is preserved across species. 

Using the proposed mixture model for gene identifi- 
cation is completely data-driven. The simulation study 
in Section 'Simulation' indicated that the mixture model 
at times report a poor observed false discovery rate due 
to large variability in measurement of expression. Cur- 
rently, gene membership classification is determined by 
maximizing the posterior probability that observation 
belongs to the /cth cluster It is possible to choose costs 
to attach weight to different types of misclassification and 
then choose a classification rule to minimize expected 



0.4 0.6 
p-values 



0.4 0.6 
p-values 



Figure 2 Histograms of /i values from tests of no treatment effects. From left to right: p value histogram for mice and p value histogram for 
humans. 
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cost. This rule may lead to a more desirable list that can 
better accommodate goals for future research. 

The mixture model currently only handles data with two 
treatments/cancer types/drugs, i.e., data containing only 
two-level variables. In practice, it is often the case that an 
experiment is run under multiple conditions. An exten- 
sion of the model to handle experiments with factors with 
three-or-more levels is an important future work. 

This approach focuses on identifying orthologs that 
show same/opposite mechanism between two species. 
However, genes identified by the bivariate mixture model 
(genes in categories (1, 2, 3, 4)) may not lead to the most 
powerful model for prediction of cancer types/response 
status for either one of the species. The most power- 
ful model may be based on those human genes that 
lie in categories 5 and 6. For those genes, the cor- 
responding animal genes show no signs of differential 
expression. Hence, incorporating prediction ability into 
the model or developing a prediction model that can 
help utilize the genes selected by the mixture model is 
of great interest and is an obvious candidate for future 
research. 

Appendix 

Mixture models and the EM algorithm 

Estimating the parameters in (4) is nontrivial. Redner and 
Walker [22] offer an excellent review of estimating the 
parameters which determine a mixture density. In partic- 
ular, the paper is devoted to a particular iterative proce- 
dure for numerically approximating maximum likelihood 
estimates (MLE) of the parameters in mixture densities. 
This method was formalized by Dempster et al. [23] and 
termed the EM {Expectation-Maximization) algorithm, 
and is used for numerically approximating the maximum 
likelihood estimates for (4). 

The EM algorithm with no constraints 

For a finite mixture model with C components, given data 

y with independent multivariate observations y^, • . . ,y^, 
each Y^ is taken to be a realization of the mixture probabil- 
ity density function, 



f(yi\^) = Y.''kfkiyi\9k), 



k=l 

where * = (Oi, . . . ,0c,7ti, . . .,7tc)^, a vector of un- 
known parameters./^ and are the density and param- 
eters of the kth component in the mixture, respectively. 
Ttk is the probability that an observation arises from the 
Ath component. Note that > 0 and X!F=i ^k = 
1. For classification purpose and to achieve minimum 
misclassification rates, y,- is assigned to the population 
(category) for which the posterior probability that y, 
belongs to the kth cluster (the Ath component of the 



mixture) is maximized. The posterior probability is given 
by 



rk(Yu'^) = 



^kfkiYiWk) 



In summary, the EM algorithm may be implemented to 
maximize the likelihood of a multivariate normal mixture 
model by following these two steps: 

E-step: The E-step on the (j+l) iteration takes the con- 
ditional expectation of the complete-data log likelihood, 

given the observed data (Qi'^ >'')). 
At the (y + 1) iteration, the E-step results in 

g c 

Q(*|*(^)) = ;^^r^(y,.;*(^')(log7r;t+log/;t(y,l^;t)), 
(=1 k=l 

(7) 

where /<:(y;|«>A:) = (2;r)-2 |i:<:r2e-i(y'-''*)''^*'(y'-''*^ 



and T<:(y,.; = 



UK 



M-step: The M-step on the (j+l) iteration requires the 
global maximization o/'Q('I'|*'^') with respect to * over 
the parameter space to give the updated estimate 



HU^kiyu'^^''^) 



^k 



(8) 



(9) 



. (10) 



The iterations of the EM algorithm continue until some 
stopping criterion is met, such as the difference of the con- 
ditional expectation of the complete-data log likelihood 
at the (y + 1) step and the conditional expectation of the 
complete-data log likelihood at the y step is sufficiently 
small, i.e.. 



Q($(7+l) I $(/■)) _ Q($0') I $(;')) 



< e. 



Throughout this research, e = 0.0001. 
The EM Algorithm with constraints 

With constraints on the parameter space, the EM algo- 
rithm derived in Section 'The EM algorithm with no 
constraints' cannot be used directly. To accommodate the 
restricted parameter space introduced for (4), first note 
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that (4) is a bivariate normal mixture with nine compo- 
nents and the density for each component can be written 
as 

fk(YtWk) 

where = (yai.yhi)^ = (film, Pihi)^ and a: = il^a^'l^h' 

The solution for the E-step remains the same. The 
membership probability Tk(^i, 9^'^) can be obtained by 
taking the conditional expectation of the complete-data 
log likelihood, given the observed data. 

As for the M-step, the MLE for Tti^ remains the same, 
j^O+i) ^ Efci^^fr,-;»'^>) _ However, the MLEs for /t^ and 
need to be modified according to the constrained 
parameter space for each mixture component. 

Essentially, the MLEs on the (/ -|- 1) iteration for 
il^ah l^hk) ^, = 1, . . . , 4, are 



The MLEs of E/., k = 1, . . .,4, at the (; -|- 1) M-step 
remain unchanged as in (10). 

The MLEs on the (/ -|- 1) iteration for (fiah t^hk)^> ^ = 
5, . . . , 8, are 



- O'+i) 



0 otherwise. 



0 



otherwise. 



0 otherwise. 
0 otherwise. 



r,ij+i) 





\ Eil 






Ei 

0 






[ Eti 


Ti(y,;* 




Ei 

0 


iri(y,.;*W) 




[ Eii 






Ei 

0 


1 t2(yi;*«) 




[ Ei, 


T2(y,.;*(/))j,^. 




Ei 

0 


1 TiiYi-M'^) 




f Eii 


r3(y,;*<''):c«,- 




Ei 

0 


1 r3(y,;*«) 




[ Eii 


r3(y,.;*(/))y;,. 




Ei 

0 


1 t3(y,;*<>)) 




[ Eii 


T4(y,;*</))y„_^ 




Ei 

0 


1 Ti(Yi-M'^) 




[ Eti 






Ei 

0 


1 t4(yi;*'^'>) 



ifEf=iti(yi;*^^')j«,>0; 

otherwise, 
otherwise. 

ifEf=it2(y,-;*^''^)j«,<0; 

otherwise. 

ifEf=it2(yi;*(^'^)7/;, <0; 
otherwise. 

ifEf=i73(y,;*<^^)j«,>0; 

otherwise. 

ifEf=i^3(yi;*^^')7/z,- <0; 

otherwise. 

ifEiir4(y,-;*<^^)j«,<0; 
otherwise. 

ifEf=i74(y;;*('^)j;,>0; 
otherwise. 



The MLE for T,i;,k = 0,5, 6, 7, 8, on the (/ -|- 1) iteration is 



( Eii ^k(Y(Mhy\ 
Ef=ir*(y,-;**^') 



V 



0 



0 

Y:UrkiYiM'')yi, 

Eil r„(y,;*<'>) / 



Regularized covariance matrices in tlie EM algoritlim 

The component covariance matrices of a mixture model 
may be singular or near singular in the EM iterative pro- 
cess. When the covariance matrices corresponding to one 
or more components are ill-conditioned (singular or near 
singular), the EM algorithm breaks down. Particularly, 
applying the EM algorithm for a mixture model with large 
numbers of components when there are actually fewer 
groups often results in the failure of the EM algorithm 
due to ill-conditioning [24,25]. Indeed, this break-down 
of the EM algorithm may imply that clusters contain 
insufficient observations and too many components are 
used to fit the data set where there are actually fewer 
clusters, or clusters contain points that are of very little 
variation compared to other clusters. Hence, an intu- 
itive solution to this is to decrease the number of the 
mixture components. However, this immediately leads to 
another question: how many clusters are needed. Though 
an active area of research, it is beyond the scope of this 
study. Nonetheless, various approaches have been pro- 
posed to generate numerically non-singular covariance 
matrices [26-33]. Among such, the regularization method 
proposed by Sato and Ishii [32] has been adopted to obtain 
numerically non-singular covariance matrices throughout 
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this research. In Sato and Ishii [32], the regularized covari- 
ance matrix for the /cth mixture component in the (/ + 1) 
M-step is 



^ J (/■+!) 



■•Rk 



■a ^ Ip, 



(11) 



where 0 



a < 1 is a small constant and Ip is a 

/^-dimensional identity matrix. If 5^^'^^^ equals 0, then 

tr(E^''^^^) is set to be a small threshold value (0.0001 in 
this research). 
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